Skip to content

[inspection-service] opens immediately#19488

Merged
rustielin merged 1 commit intomainfrom
rustielin/inspection-service-fast
Apr 23, 2026
Merged

[inspection-service] opens immediately#19488
rustielin merged 1 commit intomainfrom
rustielin/inspection-service-fast

Conversation

@rustielin
Copy link
Copy Markdown
Contributor

@rustielin rustielin commented Apr 17, 2026

Description

The node inspection service (port 9101, /metrics) previously started after RocksDB
initialization, which can take 1–2+ minutes on validators. This meant Prometheus scrapers
received connection refused for the entire RocksDB open phase, delaying metric visibility
by 1–2 minutes plus up to one additional scrape interval (30s).

The root cause was that start_node_inspection_service was called at the end of
setup_environment_and_start_node, after initialize_database_and_checkpoints,
setup_networks_and_get_interfaces, and start_state_sync_and_get_notification_handles
had all completed.

The /metrics endpoint has no dependency on any of those subsystems — it only reads from
the global Prometheus registry. Only /peer_information actually needs AptosDataClient
and PeersAndMetadata.

Changes:

  • Added InspectionServiceComponents to aptos-inspection-service — an
    Arc-shareable struct holding RwLock<Option<AptosDataClient>> and
    RwLock<Option<Arc<PeersAndMetadata>>>. Both fields start as None and are
    filled in via components.set(...) once the rest of the node finishes initializing.
  • start_inspection_service now takes Arc<InspectionServiceComponents> instead of
    the live values directly.
  • handle_peer_information_request accepts Option<...> types and returns
    503 Service Unavailable ("Node is still initializing") when either value is None,
    so clients know to retry rather than treating it as a hard error.
  • In aptos-node/src/lib.rs, start_node_inspection_service is called immediately after
    start_admin_service — before RocksDB opens. After state sync completes and
    aptos_data_client is available, inspection_components.set(...) is called to unlock
    full endpoint functionality.

Result: Port 9101 opens seconds into node startup. aptos_dkg_public_params_source
(and all other boot-time metrics) are scrapeable within the first vmagent poll after
the process starts, instead of after RocksDB finishes.

How Has This Been Tested?

  • cargo check -p aptos-inspection-service and cargo check -p aptos-node — clean,
    no warnings introduced by this change.
  • cargo test -p aptos-inspection-service — all 13 tests pass, including
    test_inspect_peer_information which exercises the fully-initialized and
    disabled-endpoint paths.

Key Areas to Review

  • InspectionServiceComponents::set is called exactly once, after state sync returns.
    There is no guard against calling it twice; a second call would silently overwrite the
    first. This is intentional — the call site is a single sequential code path — but worth
    noting.
  • RwLock choice: std::sync::RwLock (not tokio::sync::RwLock) is used because
    the read path runs inside an async handler but only holds the lock for a clone, making
    it safe to use the sync variant without risking executor starvation.
  • 503 vs other codes for /peer_information during init: 503 was chosen (over 503/425)
    because it is the conventional "retry later" signal for infrastructure scrapers and
    health checks.

Type of Change

  • Performance improvement
  • Refactoring

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)

Note

Medium Risk
Changes node startup ordering and inspection-service initialization by starting the HTTP server before storage/state sync are ready and injecting dependencies later; mistakes could break early boot observability or cause panics if components are set incorrectly.

Overview
Starts the node inspection service earlier in aptos-node (immediately after the admin service, before RocksDB/state sync) so /metrics is scrapeable from the first moments of startup.

Refactors aptos-inspection-service to accept a shared InspectionServiceComponents container that is populated later via set(...); endpoints that require late-bound values (notably /peer_information) now take Option inputs and return 503 Service Unavailable until the components are injected, while tests are updated to construct and pre-populate the new components.

Reviewed by Cursor Bugbot for commit 7490bff. Bugbot is set up for automated code reviews on this repo. Configure here.

@rustielin rustielin requested a review from ibalajiarun April 17, 2026 22:13
@rustielin rustielin marked this pull request as ready for review April 17, 2026 22:14
@rustielin rustielin requested a review from JoshLind as a code owner April 17, 2026 22:14
@rustielin rustielin requested a review from sionescu April 17, 2026 22:14
/// Holds the components that are injected into the inspection service after it starts.
/// Uses `RwLock<Option<T>>` so the service can start before these are available.
pub struct InspectionServiceComponents {
pub data_client: RwLock<Option<AptosDataClient>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be a OnceCell?

@rustielin rustielin force-pushed the rustielin/inspection-service-fast branch from 395dc7e to 7f4ee1b Compare April 23, 2026 15:29
@rustielin rustielin force-pushed the rustielin/inspection-service-fast branch from 7f4ee1b to 7490bff Compare April 23, 2026 16:13
@rustielin rustielin enabled auto-merge (squash) April 23, 2026 16:28
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite compat success on ca049383dd80675149ef2d0042668964f9f9107a ==> 7490bffe318051bdf86d65d6bf00af35525029a1

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 7490bffe318051bdf86d65d6bf00af35525029a1 (PR)
1. Check liveness of validators at old version: ca049383dd80675149ef2d0042668964f9f9107a
compatibility::simple-validator-upgrade::liveness-check : committed: 14541.50 txn/s, latency: 2376.87 ms, (p50: 2100 ms, p70: 2400, p90: 4000 ms, p99: 5600 ms), latency samples: 478400
2. Upgrading first Validator to new version: 7490bffe318051bdf86d65d6bf00af35525029a1
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6251.01 txn/s, latency: 5403.73 ms, (p50: 5900 ms, p70: 6000, p90: 6200 ms, p99: 6400 ms), latency samples: 215860
3. Upgrading rest of first batch to new version: 7490bffe318051bdf86d65d6bf00af35525029a1
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6250.81 txn/s, latency: 5374.01 ms, (p50: 5900 ms, p70: 6100, p90: 6200 ms, p99: 6300 ms), latency samples: 216340
4. upgrading second batch to new version: 7490bffe318051bdf86d65d6bf00af35525029a1
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10674.11 txn/s, latency: 3029.49 ms, (p50: 3200 ms, p70: 3300, p90: 3400 ms, p99: 3700 ms), latency samples: 356320
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 7490bffe318051bdf86d65d6bf00af35525029a1 passed
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite realistic_env_max_load success on 7490bffe318051bdf86d65d6bf00af35525029a1

two traffics test: inner traffic : committed: 14385.00 txn/s, latency: 1253.88 ms, (p50: 1200 ms, p70: 1300, p90: 1500 ms, p99: 1900 ms), latency samples: 5372740
two traffics test : committed: 100.00 txn/s, latency: 753.37 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 1100 ms), latency samples: 1640
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 0.465, avg: 0.433", "ConsensusProposalToOrdered: max: 0.125, avg: 0.121", "ConsensusOrderedToCommit: max: 0.218, avg: 0.201", "ConsensusProposalToCommit: max: 0.341, avg: 0.322"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.85s no progress at version 66300 (avg 0.06s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.59s no progress at version 5470518 (avg 0.49s) [limit 16].
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite framework_upgrade success on ca049383dd80675149ef2d0042668964f9f9107a ==> 7490bffe318051bdf86d65d6bf00af35525029a1

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 7490bffe318051bdf86d65d6bf00af35525029a1 (PR)
Upgrade the nodes to version: 7490bffe318051bdf86d65d6bf00af35525029a1
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2002.79 txn/s, submitted: 2010.49 txn/s, failed submission: 7.70 txn/s, expired: 7.70 txn/s, latency: 1506.83 ms, (p50: 1200 ms, p70: 1500, p90: 1900 ms, p99: 11100 ms), latency samples: 182141
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2318.01 txn/s, submitted: 2324.33 txn/s, failed submission: 6.31 txn/s, expired: 6.31 txn/s, latency: 1468.06 ms, (p50: 1200 ms, p70: 1300, p90: 1800 ms, p99: 11100 ms), latency samples: 198322
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 7490bffe318051bdf86d65d6bf00af35525029a1 passed
Upgrade the remaining nodes to version: 7490bffe318051bdf86d65d6bf00af35525029a1
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2322.08 txn/s, submitted: 2328.57 txn/s, failed submission: 6.49 txn/s, expired: 6.49 txn/s, latency: 1269.92 ms, (p50: 1200 ms, p70: 1500, p90: 1800 ms, p99: 2200 ms), latency samples: 207602
Test Ok

@rustielin rustielin merged commit 3660486 into main Apr 23, 2026
139 of 144 checks passed
@rustielin rustielin deleted the rustielin/inspection-service-fast branch April 23, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants